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Summary 


Long Interspersed Element-1 (LINE-1 or L1) sequences comprise the bulk of retrotransposition 
activity in the human genome; however, the abundance of highly active or ‘hot’ L1s in the human 
population remains largely unexplored. Here, we used a fosmid-based, paired-end DNA 
sequencing strategy to identify 68 full-length L1s which are differentially present among 
individuals but are absent from the human genome reference sequence. The majority of these L1s 
were highly active in a cultured cell retrotransposition assay. Genotyping 26 elements revealed 
that two L1s are only found in Africa and that two more are absent from the H952 subset of the 
Human Genome Diversity Panel. Therefore, these results suggest that "hot L1s are more abundant 
in the human population than previously appreciated, and that ongoing L1 retrotransposition 
continues to be a major source of inter-individual genetic variation. 


Introduction 


L1s comprise ~17% of human DNA and have been an instrumental force in shaping genome 
architecture (Lander et al., 2001). Most L1s are molecular fossils that cannot move 
(retrotranspose) to new genomic locations (Grimaldi and Singer, 1983; Lander et al., 2001). 
However, a small number of human-specific L1 (L1Hs) elements remain retrotransposition- 
competent (Badge et al., 2003; Brouha et al., 2003; Sassaman et al., 1997). On occasion, 
their retrotransposition has resulted in sporadic cases of human disease (reviewed in 
Babushok and Kazazian, 2007; Kazazian et al., 1988). 


During the past fifteen years, computational, molecular biological, and genomic approaches 
have been used to identify and characterize L1Hs elements (Badge et al., 2003; Boissinot et 
al., 2000; Boissinot et al., 2004; Brouha et al., 2003; Lander et al., 2001; Moran et al., 1996; 
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Myers et al., 2002; Ovchinnikov et al., 2001; Sheen et al., 2000; Xing et al., 2009). Several 
themes have emerged from these studies. First, L1Hs elements can be stratified into several 
subfamilies (pre-Ta, Ta-0, Ta-1, Tal-d, Tal-nd) based upon the presence of diagnostic 
sequence variants contained within their 5' and 3’ untranslated regions (UTRs) (Boissinot et 
al., 2000; Skowronski et al., 1988; Smit et al., 1995). Second, many L1Hs elements are 
dimorphic in that they are differentially present in individual genomes and/or are present in 
an individual, but absent from the haploid Human Genome Reference sequence (HGR) 
(Badge et al., 2003; Boissinot et al., 2004; Brouha et al., 2003; Lander et al., 2001; Myers et 
al., 2002; Xing et al., 2009). Third, it has been estimated that the average human genome 
contains ~80—100 active (retrotransposition-competent) L1Hs elements, and that only a 
small number of highly active L1Hs elements (‘hot L1s) account for the bulk of 
retrotransposition activity in the HGR (Brouha et al., 2003). Those studies, as well as recent 
efforts to identify insertion, deletion, and inversion polymorphisms (structural variants) in 
humans (Kidd et al., 2008; Korbel et al., 2007; Tuzun et al., 2005; Xing et al., 2009) indicate 
that ongoing L1 retrotransposition contributes to inter-individual genetic variation. 


Here, we employed a fosmid-based, paired-end DNA resource to identify full-length L1Hs 
elements in the genomes of six individuals of diverse geographic origin. Over half (37/68) of 
the newly identified L1s were ‘hot’ for retrotransposition when examined in a cultured cell 
assay (Moran et al., 1996). Genotyping a subset of these L1s further revealed that some are 
likely restricted to Africans, whereas others are absent from the Human Genome Diversity 
Panel (HGDP) (Cann et al., 2002) suggesting that they are present at very low allele 
frequencies. 


An experimental strategy to identify full-length human specific L1s 


To identify novel, full-length L1s in the genomes of geographically diverse individuals, we 
exploited a fosmid-based, paired-end DNA sequencing strategy that previously was used to 
identify structural variants in human DNA (Kidd et al., 2008; Tuzun et al., 2005). Fragments 
of genomic DNA approximately 40kb in size were individually cloned using fosmid vectors 
(see Extended Experimental Procedures). Sequence reads were obtained from both ends of 
each insert (paired-end sequences) and compared to the HGR. End-sequences from genomic 
fragments that do not differ significantly in size from the HGR will map ~40kb away from 
each other. In contrast, paired-end sequences derived from genomic fragments containing a 
full-length, dimorphic ~6kb L1Hs element will be separated by ~34kb when mapped to the 
HGR (Figure 1) (Tuzun et al., 2005). In general, the predicted variants were required to be 
supported by two fosmid clones containing putative insertions from the same individual. The 
size cutoffs used in our screening protocols are biased to allow the identification of full- 
length or near full-length L1 insertion polymorphisms, but not severely 5’ truncated L1 
sequences, which are replication-deficient (Table 1). Through this scheme, we should be 
able to identify the bulk of full-length L1s in an individual genome that are dimorphic when 
compared to the HGR. 


Fosmids fulfilling the above mapping criterion were subjected to a series of screens (Figure 
1). First, allele-specific oligonucleotide hybridization using probes directed against 
diagnostic sequences in the L1Hs 5' UTR identified insertion fosmids that contain putative 
dimorphic L1Hs elements (Boissinot et al., 2000;Tuzun et al., 2005). Second, Southern 
blotting with a probe directed against the 5' UTR of L1.3 (Accession# L19088) enabled the 
identification of fosmids that contained putative full-length L1Hs elements (Dombroski et 
al., 1993;Sassaman et al., 1997). Third, a suppression PCR-based method (ATLAS) (Badge 
et al., 2003) and/or direct sequencing was used to verify the presence of a full-length (or 
near full-length) L1Hs element in the fosmid. Finally, genomic sequences flanking the 5’ 
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and 3’ ends of the newly identified L1Hs elements were used as probes in BLAT searches 
(http://genome.ucsc.edu/cgi-bin/hgBlat?command=start) (Kent, 2002) to confirm that the L1 
was absent from the HGR (NCBI build 36.1/hg18). Flanking sequences also were used to 
determine whether any of the L1Hs elements were present in a database of known 
polymorphic retrotransposon insertions (dbRIP; http://dbrip.brocku.ca/) (Wang et al., 2006). 
Two additional L1Hs elements were identified through direct sequencing of the fosmids 
(#1-2-1 and 10-2-1). 


Identification of full-length L1Hs elements from geographically diverse individuals 


We first conducted a pilot study to examine a fosmid library from a female individual 
(G248; NA15510) for full-length L1Hs insertions (Table 1) (Tuzun et al., 2005). Despite the 
fact that this library was optimized for identifying ~8kb insertion polymorphisms as part of 
the Human Genome Structural Variation project (HGSV) (Kidd et al., 2008;Tuzun et al., 
2005), we were able to identify five novel L1Hs elements using our screening protocol 
(Table 1). 


The above data provided ‘proof of principle’ that our strategy was effective for identifying 
full-length, dimorphic L1Hs elements. Thus, we next screened fosmid libraries from five 
females representing four distinct geographic populations that were studied as part of the 
HapMap project (one Japanese (NA18956), one Chinese (NA18555), one Western European 
CEPH (NA12878), and two Yoruban individuals (NA19240, NA19129)) (Consortium, 
2005; Kidd et al., 2008). Size cutoffs allowed detection of insertion polymorphisms as small 
as ~4.2—5.5kb and enabled the identification of an additional 64 L1Hs elements (Table 1) 
(Kidd et al., 2008). As our strategy is biased toward finding novel, full-length L1s, we 
generally observed a decrease in the number of L1Hs elements identified in each successive 
library screen (e.g., ABC13 was the last library analyzed and contained relatively few novel 
L1Hs elements). In total, we identified 69 L1Hs elements that were absent from the HGR, 
one of which was identified in two different individuals (#4-1 and 5-77, respectively). This 
element also was completely annotated in dbRIP, unlike 65 of the distinct 68 L1s identified 
in this study (Table 1). The number of elements discovered at each stage of the analysis is 
detailed in the Extended Experimental Procedures. 


Many of the newly identified L1Hs elements are ‘hot’ for retrotransposition 


We next tested if the L1Hs elements identified in our screens were active for 
retrotransposition in cultured cells. Sixty-seven elements were cloned into either a 
pBluescript and/or pCEP4 L1 expression vector that contained an mneol retrotransposition 
indicator cassette in its 3’ UTR (#2-42 was refractory to cloning; details in Experimental 
Procedures) (Freeman et al., 1994; Moran et al., 1996). The pBluescript-based L1 constructs 
lack an exogenous promoter; thus, L1 expression is driven from its native 5’ UTR. Elements 
isolated from libraries ABC11—13 were assayed in this context. L1s isolated from the G248, 
ABC9, and ABC10 libraries were assayed in pCEP4 (CMV+/5'UTR+) and/or pBluescript 
(5'UTR+) based contexts. The resultant plasmids were transfected into HeLa cells and 
successful retrotransposition events were detected as G418-resistant foci (Figure 2a) (Moran 
et al., 1996). Retrotransposition activities are reported relative to L1.3, and ‘hot’ refers to an 
L1 that jumps at >10% of L1.3 (see Table S1). Notably, 22 elements yielded similar 
retrotransposition efficiencies relative to L1.3 when tested in either a CMV+/S'UTR+ or a 
S'UTR+ context (data not shown). Since the subcloning procedure does not involve PCR, we 
truly are testing the retrotransposition capability of each of the identified L1Hs elements in 
our screen. 


Each individual contained between three and nine ‘hot’ L1s in their genome and 55% 
(37/67) of the L1Hs elements tested were hot for retrotransposition (Figures 2a & 2b, Table 
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1). These 37 ‘hot’ L1Hs elements represent an approximately 4-fold increase in the number 
of ‘hot’ L1s identified in previous studies (Badge et al., 2003; Brouha et al., 2002; Brouha et 
al., 2003; Kimberland et al., 1999; Lander et al., 2001; Sassaman et al., 1997). Examination 
of the 3’ UTR sequences of the 68 L1s uncovered six elements that contain an ACG in place 
of the Ta subfamily diagnostic ACA characters. These elements are termed ‘pre-Ta’, and 
represent an older L1s subfamily (Boissinot et al., 2000; Brouha et al., 2003; Kazazian et al., 
1988; Lander et al., 2001; Myers et al., 2002; Skowronski et al., 1988). Two pre-Ta L1s 
(#3-5 and 5-55) were ‘hot’ for retrotransposition (Figure 2B; Table S1). These data agree 
with previous studies, which showed that a de novo insertion of a pre-Ta L1 into the Factor 
VIII gene resulted in a sporadic case of hemophilia A (Kazazian et al., 1988). 


Hallmarks and insertion locations of L1s identified in this study 


We next sequenced each L1Hs element in its entirety and compared these data to fosmid 
sequences previously deposited in GenBank (Kidd et al., 2008). We annotated each L1 for 
hallmarks of retrotransposition as well as their chromosomal environment (Table S2). In 
general, the L1Hs elements were flanked by target-site duplications that ranged from 6 to 
20bp, inserted into an L1 endonuclease consensus cleavage sequence (Cost and Boeke, 
1998; Feng et al., 1996; Morrish et al., 2002), and their 3' ends had either homopolymeric 
poly (A) tails that ranged from ~8—41bp in size or interrupted poly (A) tails/3’ transductions 
ranging from ~18bp to 1,105bp in length (Table S2) (Goodier et al., 2000; Holmes et al., 
1994; Moran et al., 1999; Pickeral et al., 2000). 


A subset of the elements (~32/68) contained an additional 1—-14bp of untemplated 
nucleotides at their 5' ends, termed 5’ end heterogeneity (Athanikar et al., 2004; Lavie et al., 
2004). Five of these L1s have an extra G at their 5’ ends, and one has three extra Gs when 
compared to a ‘hot’ L1Hs consensus sequence (Brouha et al., 2003). These extra nucleotides 
potentially could result either from a terminal transferase activity associated with the L1 
reverse transcriptase, or reverse transcription of the 7-methylguanosine cap at the 5’ end of 
L1 RNA (Boeke, 2003; Gilbert et al., 2005; Symer et al., 2002). The majority of elements 
identified were full-length; however, we also found 7 elements (e.g. #1-5 and 2-30) that 
were truncated within their 5’ UTR. These data, along with the fact that the fosmid libraries 
provided ~4—5 fold coverage of each haplotype from the 6 individuals (Kidd et al., 2008), 
indicate that our screening procedure identified the majority of the full-length L1s in these 
genomes. 


The 68 L1Hs elements were dispersed throughout the genome. We did not identify L1Hs 
elements on chromosomes 16 or 19 (Figure 2c); however, this result probably reflects our 
small sample size rather than a systematic bias against their ability insert on these 
chromosomes (Lander et al., 2001). Consistently, we previously were able to detect the 
insertion of engineered L1s into chromosomes 16 and 19 of HeLa cells (Gilbert et al., 2005). 


Approximately 32% (22/68) of L1Hs elements were present in the introns of known RefSeq 
genes (http://www.ncbi.nlm.nih.gov/RefSeq/), and mutations in several of these genes are 
implicated in human genetic disorders (Table S3). Thirteen L1 insertions were in the anti- 
sense orientation (i.e., were transcribed in the opposite orientation to the gene), whereas 9 
L1 insertions were in the same transcriptional orientation as the gene. Since ~26—38% of the 
genome is spanned by genes (Venter et al., 2001), the data suggest that the L1s have inserted 
randomly with respect to gene content, which is in agreement with previous studies (Gilbert 
et al., 2005;Gilbert et al., 2002;Ovchinnikov et al., 2001;Symer et al., 2002). 


Our sequencing studies uncovered several expected trends and some unexpected results. All 
37 ‘hot’ L1 elements and the 6 low-level activity elements had two intact open reading 
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frames (ORFs). A consensus sequence derived from these 37 ‘hot’ L1s was identical at the 
amino acid level to a previously derived consensus (Brouha et al., 2003). 


Inactive elements generally had frame shift (5/24) or chain-terminating nonsense mutations 
(9/24) in at least one of the L1 ORFs. However, 10 of these low-level activity or inactive 
elements contained two intact open reading frames. One L1 (#3-24) contained an $228P 
missense mutation within the endonuclease (EN) domain of ORF2p (Feng et al., 1996; 
Weichenrieder et al., 2004). Though L1s containing EN mutations are unable to 
retrotranspose in HeLa cells, they can retrotranspose in Chinese Hamster Ovary (CHO) cells 
deficient in the non-homologous end-joining (NHEJ) pathway of DNA repair, presumably 
by parasitizing a free 3' OH group to initiate target-primed reverse-transcription (TPRT) 
(Morrish et al., 2007; Morrish et al., 2002). Interestingly, although #3-24 is inactive in 
NHEJ proficient cell lines, the L1 retrotransposed at roughly 60% the efficiency of the wild- 
type control, L1.3, in NHEJ deficient CHO cells (Morrish et al., 2002). Introducing the 
S228P change into L1.3 (Sassaman et al., 1997) also allowed efficient EN-independent 
retrotransposition, indicating that this mutation is largely responsible for the inactivity of 
#3-24 in HeLa cells (Figure S1). 


Analysis of genomic sequences flanking the 68 L1Hs elements revealed a number of 
interesting findings. The poly (A) tails of 25 L1s were interrupted or contained 3’ 
transductions (Goodier et al., 2000; Holmes et al., 1994; Moran et al., 1999; Pickeral et al., 
2000), seventeen of which clustered into ‘subfamilies’ of L1Hs elements. In one case, we 
identified an L1 (#2-1) as the likely source element for one of these ‘subfamilies’. For #1-3, 
3-31, and 1-5, these transductions/interrupted poly (A) tails were identical to those in L1Hs 
elements that have caused disease-producing mutations (e.g., Llep, LRE3) (Brouha et al., 
2002; Kimberland et al., 1999). In other cases, the transductions denote examples of recently 
amplified subfamilies (Goodier et al., 2000; Lander et al., 2001; Pickeral et al., 2000). 


Examining the 5’ genomic flanks showed that the retrotransposition of a full-length L1 from 
the ABC9 genomic library (#2-24) that integrated on chromosome 10 was accompanied by 
~250bp of an Alu element which maps to chromosome 16. The Alu sequence is in the 
opposite transcriptional orientation to the L1, 13bp of unmapped sequence separates the 
elements, and the whole insertion was flanked by target site duplications (TSDs) (Figure 
S2). Thus, though most of the full-length L1 Hs elements identified here have amplified by 
canonical retrotransposition, recombination and/or replication-mediated repair processes 
may facilitate the integration of some elements (Gilbert et al., 2005; Gilbert et al., 2002; 
Symer et al., 2002). Additionally, our screen allowed us to resolve possible sequence 
anomalies in the HGR. For example, one fosmid that lacks a dimorphic L1Hs element 
(#6-105) actually contains two L1s (a PA2 and pre-Ta element) that likely were collapsed 
into a harlequin element during the HGR assembly (Figure S2). 


Finally, the data also enabled us to examine allelic heterogeneity associated with L1Hs 
elements. For example, one L1 (#5-70) was present in the HGR, but contained a stop codon 
in ORF2 and was not tested for activity (Brouha et al., 2003). Interestingly, #5-70 
retrotransposed at ~8% of the level of L1.3, further illustrating how allelic heterogeneity can 
impact retrotransposon activity (Lutz et al., 2003; Seleme et al., 2006). 


Allele frequencies of genotyped elements 


The 68 L1Hs elements identified here are dimorphic with respect to presence; thus, we 
tested if a subset of these L1s represented population-restricted or potentially private alleles. 
To address this question, we first compiled existing genotyping data (Badge et al., 2003; 
Myers et al., 2002; Xing et al., 2009). Additional genotyping then was conducted on a subset 
of the L1s discovered here (26 in total; see Supplemental Information for selection criteria). 
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The 26 Ls first were genotyped in a CEPH panel of 129 unrelated individuals. Nine Lis 
absent from the CEPH panel then were genotyped in a Zimbabwean panel of 72 unrelated 
individuals. Finally, if the element was absent from both panels, it was genotyped on the 
H952 subset of the HGDP consisting of ~1050 individuals from ~51 worldwide populations 
(Figure 3a and Table S4) (Cann et al., 2002; Rosenberg, 2006). 


Two elements (#3-5 and 3-31) genotyped on the HGDP exist at very low allele frequencies 
and were only found in Africans. Two other L1Hs elements (#1-5 and 3-24) were absent 
from the HGDP (Table S4). Element #3-24 (the S228P mutant described above) was found 
in the ABC10 Yoruban library. Further genotyping revealed that the L1Hs element 
containing the mutation was present in her mother (but not her father), excluding a de novo 
origin (Figure 3b). The other putatively ‘private’ L1Hs element was from G248 (#1-5), so 
we could not examine its segregation in a trio. Interestingly, this L1 insertion occurred into 
an intron of the ABCA/ gene (Figure 3c); mutations in ABCA/ have been associated with 
Tangier disease and low serum HDL levels (Frikke-Schmidt, 2009). 


The total number of active L1Hs elements present in ABC13 


To estimate the total number of active L1s in one individual, we carried out in silico 
genotyping of the 68 L1Hs elements in ABC13, the last library examined in our subtractive 
scheme. We identified 20 regions containing distinct L1 insertions identified in the first 5 
individuals that corresponded to insertion fosmids in the ABC13 HGSV track 
(http://hgsv.washington.edu/) of the UCSC genome browser (Figure 4a, Table S4) (Kent et 
al., 2002; Kidd et al., 2008). PCR genotyping confirmed that ABC13 contained 18 of these 
20 elements (Figure 4b), and was homozygous with respect to presence for three of the 
elements. This result suggests that in silico genotyping could be used as a screening tool to 
identify L1Hs elements present at low allele frequencies in the population (Table S4). 


Adding the 18 L1Hs elements identified by in silico genotyping to the seven novel L1Hs 
elements identified in the ABC13 genome through our fosmid screens revealed that this 
individual contains 25/68 L1Hs elements identified in this study. Additional genotyping 
revealed that this individual contains 2 of the ‘hot’ L1s characterized in a previous study 
(Table 1) (Brouha et al., 2003). Combining these numbers with our retrotransposition data 
indicates that the ABC13 genome contains 14 potentially ‘hot’ L1Hs elements, and that at 
least 3 of these elements are present in a homozygous state. 


Estimates of L1 age 


Our data suggest that, on average, the 68 L1Hs elements identified here are present at lower 
allele frequencies, are more active, and may be evolutionarily younger than those in 
previous studies (Brouha et al., 2003). To test this hypothesis, we derived maximum 
likelihood estimates for the ages of Ta-1 L1Hs elements in our dataset and that of Brouha et 
al. (Brouha et al., 2003; Marchani et al., 2009). This analysis revealed that the Ta-1 L1Hs 
elements identified here are significantly younger (1.0 MY 95% C.I. 0.98-1.01 MY) than 
those reported previously (2.01 MY 95% C.I. 2.00-2.02 MY) (Marchani et al., 2009) (1.73 
MY 95% C.I. 1.69-1.77 MY) (Brouha et al., 2003). 


The maximum likelihood estimated age (Marchani et al., 2009) (1.0 MY) of the L1s 
reported here differs significantly from that calculated using the ad hoc method, which uses 
sequence divergence within subfamilies of elements to determine age (Carroll et al., 2001) 
(1.18 MY old). These two methods are known to be respectively robust (the maximum 
likelihood method) and sensitive (the ad hoc method) to the presence of multiple active 
lineages in the dataset (i.e. departures from the master gene model of L1 evolution) 
(Cordaux et al., 2004). The difference in these two estimates may indicate that members of 
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multiple active L1Hs subfamilies are present in our dataset, and suggests that the true age of 
the L1s may be younger than either calculation suggests. Indeed, the above data are 
consistent with the hypothesis that the HGR is strongly biased in favor of older, fixed L1Hs 
elements. 


We next used a neighbor joining approach, rooted with an intact chimpanzee L1 element, to 
generate a phylogenetic tree of the 68 full-length L1Hs elements (Figure 5, see Extended 
Experimental Procedures). As predicted, pre-Ta elements were located near the root of the 
tree. Interestingly, two known (Ll pp & LRE3) and five other currently amplifying 
‘subfamilies’ clustered together on the tree (Figure 5; see groups of colored elements), even 
though the interrupted poly (A) tail/transduction sequences themselves were excluded from 
the sequence alignments. 


Discussion 


We have developed a systematic process to identify novel, dimorphic, active L1Hs elements 
in genomes of individuals from diverse geographic populations. Many of the newly 
identified L1Hs elements exist at low allele frequencies in the population and four L1Hs 
elements represent ‘rare’ alleles, three of which appear to be restricted to Africans. 
Sequence-based age estimates further reveal that these L1Hs elements appear to be, on 
average, evolutionarily younger than those identified in previous studies (Brouha et al., 
2003; Marchani et al., 2009). These data are consistent with the notion that full-length active 
LIs are systematically underrepresented in available genome reference sequences (Badge et 
al., 2003; Boissinot et al., 2004; Brouha et al., 2003; Sassaman et al., 1997; Sheen et al., 
2000; Xing et al., 2009). 


Our study has underscored the effectiveness of fosmid paired-end libraries in the discovery 
of novel, active L1Hs elements. Though a number of technologies have been developed to 
identify polymorphic L1s (Badge et al., 2003; Boissinot et al., 2004; Brouha et al., 2003; 
Moran et al., 1996; Myers et al., 2002; Sheen et al., 2000; Xing et al., 2009), the approach 
described here is not reliant upon PCR fidelity, readily allowing the identification of active 
L1Hs elements and making sequencing of genomic flanking sequences, poly (A) tails, and 
L1-mediated transductions relatively straightforward. Thus, we predict that the fosmid-based 
approach likely will be superior to second-generation, low-coverage genome sequencing 
methodologies (e.g., many individual genomes characterized in the 1000 genomes project; 
http://www. 1000genomes.org/page.php) for comprehensively identifying and characterizing 
‘rare’ L1 alleles in individual genomes. Indeed, recently published genome sequences 
highlight the difficulties in detecting and unambiguously mapping highly repetitive 
insertions (relative to a reference genome), including L1Hs elements (Bentley et al., 2008; 
McKernan et al., 2009; Wang et al., 2008; Wheeler et al., 2008). 


Our analysis revealed that many active L1s cluster in small ‘subfamilies’. In the strictest 
sense, these data argue against a master gene model (Deininger et al., 1992) and instead 
support a model in which multiple active source L1Hs elements (including members of both 
the pre-Ta and Ta-subfamilies) are currently retrotransposing in modern human genomes 
(Cordaux et al., 2004). We cannot formally exclude a ‘stealth’ model, where L1s in 
unfavorable expression contexts sometimes give rise to new retrotransposition-competent 
source elements that can be expressed from a more favorable genomic context (Han et al., 
2005). However, the most parsimonious explanation of our data is that multiple source L1Hs 
elements and subfamilies with limited ‘life-spans’ exist in the genome. We posit that ‘hot’ 
L1Hs elements must give rise to new, active progeny at a faster rate than they are inactivated 
by cellular mutational processes (see Figure 6 for model); this can lead to a scenario where 
small numbers of currently active L1Hs lineages may out-compete older L1s for limiting 
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reagents, such as host factors (Boissinot and Furano, 2001). This competition scenario both 
supports and extends current lineage succession models and could potentially explain the 
monophyletic history of L1s and the appearance of a replication-dominant L1Hs subfamily 
(Boissinot et al., 2000; Cordaux et al., 2004; Seleme et al., 2006). 


Our data set is still relatively small, and it remains difficult to estimate the actual number of 
‘hot’ L1s in the extant population. However, our ability to readily identify rare ‘hot’ L1s in 
the genomes of geographically diverse individuals strongly suggests that these highly active 
L1Hs elements are more abundant in the population than previously appreciated. The active 
L1Hs elements identified here also have the potential to impact modern human genomes by 
retrotransposing flanking genomic sequences to new chromosomal locations and by serving 
as substrates for non-allelic homologous recombination (reviewed in Cordaux and Batzer, 
2009; Moran et al., 1999). The proteins encoded by these L1s also may promote the 
retrotransposition of Alu elements and non-coding RNAs (Bennett et al., 2008; Dewannieux 
et al., 2003; Garcia-Perez et al., 2007). Indeed, our data support the hypothesis that ‘hot’ L1s 
are actively retrotransposing in modern-day human genomes and suggest that some of the 
L1 alleles identified here could serve as source elements for disease-producing L1 
insertions. 


Experimental Procedures 


Creation of Fosmid Libraries and Identification of Insertion-Containing Fosmids 


Genomic DNA from the 6 individuals was obtained from transformed lymphoblastoid cell 
lines (available from the Coriell Cell Repository). The DNA was hydrodynamically sheared, 
end-repaired, size selected for 40kb fragments by pulsed field gel electrophoresis, and 
ligated into fosmid vectors (Donahue and Ebling, 2007). Agencourt Biosciences Corporation 
constructed all libraries, with the exception of the G248 library, which was constructed as 
part of the human genome project finishing effort. From each library, approximately 1 
million individual cloned fragments were arrayed into 384-well plates. End-sequence pairs 
were obtained from both ends of each DNA fragment using standard capillary sequencing 
and were mapped back to the HGR. Insertion-containing fosmids were identified as the 
subset of fosmids containing an apparent insert that was ~3 standard deviations smaller than 
the library mean (Kidd et al., 2008; Tuzun et al., 2005). 


Screening of Fosmid Clones for LINE-1 Insertions 


Insertion-containing fosmids identified in silico were screened for L1Hs elements in the 
following manner. First, all insertion fosmids were subjected to allele-specific 
oligonucleotide hybridization to identify characters in the 5' UTRs of newer L1 subfamilies 
(Badge et al., 2003; Boissinot et al., 2000). This protocol was adapted from ‘hybridization of 
bacterial DNA on filters’ (Sambrook, 1989). Fosmid DNAs were prepared according to the 
Very Low-Copy Plasmid/Cosmid Purification protocol for the Qiagen-tip 100 Midi prep kit 
(Qiagen). Those DNAs were subjected to Southern blotting followed by ATLAS (Badge et 
al., 2003) and/or direct sequencing to identify L1Hs elements that were absent from the 
HGR. Sequences flanking the L1Hs elements then were used as probes in BLAT searches at 
the UCSC genome browser (http://genome.ucsc.edu/) to determine the insertion site in the 
HGR (Kent, 2002; Kent et al., 2002). Detailed protocols for each step of the screening 
process, as well as the number of fosmids positive at each stage of the analysis, can be found 
in the Extended Experimental Procedures. 


Cloning of Lis 


In general, L1Hs elements were cloned directly from insertion-containing fosmids by 
digestion with AccI (Sassaman et al., 1997). The restricted DNA was separated on a 0.8% 
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agarose gel, and the ~6kb L1-containing restriction fragment was cloned into an L1 
expression vector. This method captures the vast majority of the L1Hs sequence, leaving 
only the first ~35bp and last ~50bp of the original L1 5’ and 3’ UTRs present in the cloning 
vector, respectively. One element, #2-42, was refractory to this cloning procedure, as it 
contains a polymorphism near the 3’ end of ORF2 that creates an additional AccI site. The 
PDH L1.3 mutant was generated by site-directed mutagenesis. Each L1Hs element was 
sequenced in its entirety. Detailed protocols for the creation of each construct are included in 
the Extended Experimental Procedures. 


L1 Retrotransposition Assays 


We used a modification of a transient transfection protocol to conduct retrotransposition 
assays in HeLa and CHO cells (Moran et al., 1996; Morrish et al., 2002; Wei et al., 2000). 
Briefly, cells in 6-well dishes were transfected using the Fugene 6 agent (Roche) with lug of 
plasmid (containing the indicator cassette) per each well. Cells were fed with media ~24 
hours post plating, and daily from 72 hours with media plus either 400ug/mL G418 or 10ug/ 
mL blasticidin. Fourteen days post transfection, cells were fixed and stained with 0.1% 
crystal violet. Colonies were counted in the appropriate wells, and these counts were 
normalized to GFP transfection efficiency. Detailed protocols for culture and assay 
conditions are found in the Extended Experimental Procedures. 


Genotyping and Panels 


The genomic locations of L1Hs insertions were compared to a database of human 
retrotransposon insertion polymorphisms (dbRIP; http://dbrip.brocku.ca/) (Wang et al., 
2006). PCR genotyping assays were designed for a subset of L1Hs elements that were not 
completely annotated in dbRIP. Genotyping initially was conducted on a CEPH panel of 129 
unrelated individuals of Northern European ancestry. If a L1Hs element was absent from the 
CEPH panel, it was genotyped on a panel containing genomic DNAs from 72 unrelated 
Zimbabwean individuals. Finally, if an L1Hs element was absent from both genotyping 
panels, it was genotyped on the H952 subset (Rosenberg, 2006) of the HGDP (Cann et al., 
2002) (see Figure 3b). In silico genotyping was conducted using the HGSV track of the 
UCSC genome browser (Kent et al., 2002; Kidd et al., 2008). Details about these analyses 
are in the Extended Experimental Procedures. 


Estimation of L1 Element Age 


Sequences of the 69 full-length L1 elements were classified into subfamilies using the 
L1Xplorer analysis website (Penzkofer et al., 2005). Ta-1, Ta-O and Non-Canonical (NC) 
(Brouha et al., 2003) elements were separately aligned using Muscle 3.52 (Edgar, 2004) on 
the Phylomen web server (http://phylemon.bioinfo.cipf.es/cgi-bin/home.cgi) (Tarraga et al., 
2007). Raw alignments were manually refined to remove all indels, all variable CpG sites 
and the L1 polypurine tract using Jalview (Waterhouse et al., 2009). Maximum likelihood 
estimates of the age (T) of each group, the sampling variance of T, and its 95% confidence 
intervals were calculated using the mleT script (Marchani et al., 2009) running under Matlab 
7.2 -2007a (The Mathworks Inc., Natick, MA). The subroutine CountMutations (Marchani 
et al., 2009) was also utilized to calculate the number of substitutions in the datasets to 
enable the “ad hoc” subfamily age estimation method (Marchani et al., 2009).] 


Phylogenetic Tree 


The sequences of the 69 elements were aligned as described above. Raw alignments were 
manually refined using Jalview (Waterhouse et al., 2009) to remove large indels and 
truncated elements; this led to the exclusion of #6-113 due to a large 5' UTR deletion. 
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A single Neighbor Joining tree of the 68 remaining full-length elements was constructed 
using the PHYLIP package (Felsenstein, 1989). Branch lengths were corrected using the 
Kimura 2 parameter model (Kimura, 1980). To assess the reliability of the phylogeny, 1000 
bootstrapped re-samples of the multiple alignment were made using the seqboot program of 
the PHYLIP package (Felsenstein, 1989). The neighbor joining tree derived from the full 
dataset was manually annotated with bootstrap values using Dendroscope (Huson et al., 
2007) (Figure 5). Only bifurcations that occurred in more than 70% of bootstrap re-samples 
are labeled. 


Supplementary Material 


Refer to Web version on PubMed Central for supplementary material. 
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Figure 1. A strategy for identifying dimorphic L1Hs elements in individual human genomes 

In silico comparison of the fosmid end sequences (red squares) from individual genomic 
libraries (blue horizontal line) and the HGR (pink horizontal line) enables the detection of 
fosmids that may contain insertions or deletions with respect to the HGR (see dashed lines). 
Insertion fosmids were screened by allele specific oligonucleotide hybridization to detect 
characters that are present in the 5’ UTR of newer L1 elements (one discriminating character 
utilized, a deletion of the G residue at bp 74 in recent L1s, is indicated in maroon). Putative 
L1Hs-containing fosmids were analyzed by Southern blotting with a 3" UTR probe (blue 
arrow). A representative digest and Southern blot is shown. The ~6kb band is diagnostic for 
the full-length L1. The additional hybridizing band (~1.3kb band liberated from the L1 5’ 
flank in this Southern blot example) serves to distinguish individual fosmids. ATLAS and/or 
DNA sequencing confirmed the presence of a dimorphic, full-length L1Hs insertion. 
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Figure 2. L1Hs activity in 6 human genomes 

(a) Cloning strategy: All but one L1Hs element were cloned directly from fosmids using 
Accl sites in their 5’ UTR and 3’ UTRs, respectively (red vertical lines; see Extended 
Experimental Procedures). The L1s then were ligated into vectors that either contain or lack 
a CMV promoter (black rectangle). Both vectors contain the mneol retrotransposition 
indicator cassette (light blue) in the L1 3’ UTR. This cassette allows for detection of 
retrotransposition events in a cell culture retrotransposition assay. SD=splice donor. 
SA=splice acceptor. Active elements confer G418 resistance to HeLa cells, whereas 
defective elements, as illustrated by the RT mutant control (RT- L1), do not. (b) 
Representative G418-resistant foci for the 20 elements from the Yoruban library, ABC10: 
Nine of these elements were highly active (large suns to the left of assay image), and two 
more retained a low level of activity (small suns). One element (#3-5, red box) is a ‘hot’ pre- 
Ta L1 (#3-5 was tested in a pBluescript backbone (5’UTR+); all others were tested in a 
pCEP4 (CMV+/5'UTR+)) backbone (Extended Experimental Procedures). Table S1 displays 
retrotransposition efficiencies for each L1 identified in this study. Figure S1 provides details 
on the EN-deficient element #3-24. (c) The 68 distinct L1Hs elements identified in this 
study and their positions in the genome: Red vertical lines and text represent ‘hot’ or highly 
active elements. Orange vertical lines with black text represent low-level activity elements. 
Blue vertical lines with black text represent dead or inactive elements. The black line 
indicates the one untested element (#2-42). Ideograms were adapted from UCSC genome 
browser: http://genome.ucsc.edu (Kent et al., 2002). 
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Figure 3. Allele frequencies of L1Hs alleles in the population 

(a) Genotyping assays: L1s were queried in panels of individuals for their absence (solid 
grey lines), or presence (red line). Genotyping of 26 elements in the three panels allowed the 
discovery of population restricted or potentially ‘private’ L1Hs elements. The expected 
amplicon sizes are diagrammed for element #3-24. (b) Pedigrees showing the inheritance of 
two elements typed in the ABC10 trio: Genotyping gels show the heritability of #3-31 
(African specific) and #3-24 (absent from the HGDP). E and F at the top of the gel image 
indicate PCR results for empty and filled sites. M, F, and C at the bottom of the image 
indicate lanes for the mother, father, and child of the trio. (c) Example data sheet for the 
G248 element #1-5: Empty site: insertion site in the HGR. EN cleavage site: the 
endonucleolytic cleavage site used by L1 EN to initiate retrotransposition. pA length: the 
approximate L1 poly (A) tail length; 3’ transductions and interrupted poly (A) tails also are 
annotated. TSD length: the length of the target site duplication flanking the L1Hs element 
(underlined lettering). Table S2 contains data sheets for each L1 in this study. Table S3 
contains L1Hs insertion locations with respect to genes. Figure S2 displays a non-canonical 
L1Hs insertion and documents a possible sequence anomaly in the HGR. 
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Figure 4. An estimate of the number of active L1Hs elements in an individual (ABC13) genome 
(a) In silico genotyping: The last library in our study, ABC13, was examined in silico (see 
text) for the presence of insertion fosmids mapping to the location of L1Hs elements found 
in other individuals. Element 3-17 is used as an example. All blue lines represent insertion 
fosmids in the genomes of the 8 individuals on the HGSV track 
(http://hgsv.washington.edu/) of the UCSC genome browser (http://genome.ucsc.edu) (Kent 
et al., 2002). The ABC7, 8, and 14 libraries were not investigated in this study. (b) PCR 
validation: The elements identified in silico were genotyped using the scheme shown in 
Figure 3 to validate the predictions from the HGSV track of the UCSC browser. Element 3- 
17 is used to illustrate the genotyping. ABC10 and ABC13 are heterozygous with respect to 
the L1Hs insertion. ABC11 lacks the L1Hs insertion. Table S4 displays genotyping results 
for all elements in this study. 


Cell. Author manuscript; available in PMC 2011 January 2. 


jduosnuey oU Wd-HIN Jduosnuey oU Wd-HIN 


\duosnuey\ JOUINY VWd-HIN 


Beck et al. 


Gate 5_G248 #21 TATIANE 
sl 6_ABC9_1_TA1(141)_AF=0.178 l 
-ABCO 12 TAT(T3ONE| AF=0.223 


— 4_G248_#5_TA1(119)[C]_AF=0.000 + LRE3 


29_ABC10_10_TA0(126)_AF=0.025 
15_ABC9_32_TA1(118)_AF=0.207 
HOT_CONSENSUS(Brouha et al., 2003) 


37_ABC10_34_TA1(68)_AF=0.078 

[897 3_G248_#4_7_TA1(107)_AF=0.440 

|____________— >) _ qaaa_#3_6_TA1(69){A]_AF=0.210 } ne 
1000 25_ABC10_4_TA1(98)_AF=0.025 

-= — 69_ABC13_107_TA1(139) 


68_ABC13_102_TA1(129) 
14_ABC9_30_TA1(46)_AF=0.090 


oa po 
23 ABC10_1_TA10[D] ) 
12_ABC9_24_TA1(<1) 


d gem @248_#2_"_TA1(0)_AF=0.010 


35_ABC10_25_TA1(170)_AF=0.003 


TR 21_ABC9_53_TA1(86)[B]_AF=0.063 } 
DI 


67_ABC13_100_9_TA1(<1)[8} 


56_ABC12_36_TA1(1) 
apen 27 wem 

29° | STE) Iv 

86 46_ABC11_8_TA1(161) 
r 53_ABC11_32_TA1(11) 
70_ABC13_109_TA1(73) 
31_ABC10_15_TA1(0)_AF=0.430 
39_ABC10_38_TA1(6) 
i—i 22_ABC9_59_TA1(13) 
60_ABC12_66_TA1(23) 


~ 24_ABC10_3_TA1(15)_AF=0.264 
V 7_ABC9_6_TA1(0)_AF=0.012 


33_ABC10_18_TA1(62) 
J— L19088_L1.3_TA1(100) 
65_ABC13_97_TA1(36) 


719 | 980 


59_ABC12_58_TA1(36) 
zl kat Apr Tam 
63_ABC12_88_TA1(0) 


19_ABC9_2_42_TA1(1.04) 
| r 8_ABC9_7_TA1 (0) 
L 17_ABC9_39_TA1(0)_AF=0.051 


41_ABC10_40_TA1(0)_AF=0.091 
el r— 43_ABC11_1_TA1(0) 
1000 61_ABC12_77_TA1(0) 


20_ABC9_61_TA1(0) 
44_ABC11_5_TAO(1) 


49_ABC11_19_TA1(0) 


13_ABC9_25_TAO(0) 
30_ABC10_14_TAQ(<1)_AF=0.094 
963 [> 16_ABC9_38_TA0(63) 

805 L 45_ABC11_7_TA0(69) 
42_ABC10_2_1_TA0(3) 


54_ABC11_34_TA0(57) 

28_ABC10_7_TAOQ(<1) 

38_ABC10_36_TAQ(0) 

999 -— 52_ABC11_29_TA0(38) 

64_ABC13_91_TA0(157) 

955 1000 27 _ABC10_6_TA0(0) 

50_ABC11_20_TA0(0) 
48_ABC11_17_NC(0) 


57_ABC12_54_NC(0) 


1000 AL357559_PTA(0.3") 


v 
(pre-Ta) 
62_ABC12_82_NC(<1) 


AL022171_NTA(0.0") 
BS000022_PTROG_(ND) 


Figure 5. Phylogenetic tree of the L1Hs elements identified in this study 

The tree is a single neighbor-joining tree (with branch lengths corrected using the Kimura 2 
parameter model of nucleotide substitution) with 68 full-length elements from our study. 
The numbers at particular nodes indicate the number of times that node was observed in 
1000 bootstrap replicates of the dataset. Only bootstrap values exceeding 70% are shown. 
The brackets at the right side indicate previously described ‘transduction subfamilies’ (Ll pp 
(labeled RP in the Figure) & LRE3) and distinct L1Hs ‘subfamilies’ currently capable of 
amplifying in human genomes (I-V) (Goodier et al., 2000; Pickeral et al., 2000). Those 
subfamilies are highlighted in the same color to show their clustering on the tree. 
Retrotransposition activity (% relative to L1.3) as well as allele frequency (e.g., AF= 0.012), 
if determined, is appended to the sequence identifiers. Element #11-17 contains ACG 
characters in its 3’ UTR, which are diagnostic for pre-Ta L1s; however, the element clusters 
with the Ta0 subfamily. The tree and age estimates use sequences indicated in the 
Supplemental Information. 
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Fig. 6. Multiple source loci model for continued L1Hs activity 

An element (source locus) that is both active and in a conducive genomic environment can 
retrotranspose. Shown here is an example of a progenitor element that can be associated 
with subsequent members of a ‘family’ through the use of interrupted poly (A) tails and/or 3' 
transduced sequence (3’ red arrow and line). Distinct elements are marked by distinguishing 
TSDs specific for their new integration site (different colored horizontal arrows). There are 
many of these ‘families’ active in human genomes, such as Ll rp, LRE3, and the 5 ‘families’ 
noted in Figure 5. Although host processes (lightning bolt) may inactivate some older 
elements, some of their descendents may retain the ability to retrotranspose and could harbor 
the 3’ transduction/interrupted poly (A) tail. 
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